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1- INTRODUCTION 


1-1 Symbolic models / connectionist models 
Artificial Intelligence 
Historically 


2 goals: 
engineering: build up intelligent systems 
> cognitive science: produce a theory of intelligence 


—+ in the 50's: 


theory of intelligence via: 


automata theory 
adaptative machines (perceptron) 


— in the 70's: 
engineering: 


expert systems ; 
natural language understanding 
vision | 


speech ... 
— but today: 


problem of knowledge acquisition 


gn 
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The traditional opposition between: 


Symbolic AI Connectionist networks 


logics analogic 
sequential parallel 
discrete continuous 
local distributed 
programmation adaptation 
high level low level 


1-2 Knowledge representation 


¢ In Artificial Intelligence 


—> a Prolog program 

— a semantic network 

¢ In connectionist models: 
— local representation 


— distributed representation ; 
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1-3 Inference mechanisms 

¢ In Artificial Intelligence 

— logics (Prolog) 

— Inheritance mechanisms 
e In Connectionist models 

— dynamics 

1-4 Learning 

e In Artificial Intelligence 

—» symbolic: learning deals with rules 


e In Connectionist models 


— numerical: learning occurs on synaptic weights 


Se set i eh I ioe 
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In the future: 


small connectionist modules highly specialized 
eventually controlled by 


symbolic systems 











-induction 
Symbolic deduction 

controller -explanations 

-canceptual 
representation 
- learning 
~representation 
. generation 
-noise reduction 
~analogic data 
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1-5 Connectionism today 
Domain is rapidly expanding, 

| mainly in the US and Europe 
3 lines: 

- theoretical work 

- realization of prototype applications 


- development of dedicated hardware 
Real size industrial applications are starting 


Mainly in the domain of perception: 
- vision 
- speech 
- signal 


- automatic diagnostic 


and development of "neural" hardware: 
- chips, boards 


- machines 
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APPLICATIONS SURVEY 
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NEURAL NETWORK SIMULATION 


HARDWARE | 
(DARPA STUDY) 


— I RT 


WORKSTATIONS INTERCONNECTY INTERCONNE 
/s 
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COMPUTERS 35 K 
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PROCESSORS 10M 





3M 
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POTENTIAL APPLICATIONS 
(DARPA STUDY) 


Required computing power 
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TECHNOLOGICAL DEVELOPMENTS 
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2- BASIC MATHEMATICAL TOOLS 
2-1 Definitions 


— An automaton is characterized by: 
an internal state s_S$ 


input signals: Sz» «ee » Sq 
a state transition function: s=f(s,,...,$,) 


—> Examples: 
S = {0,1} 


S = {0,1,...,5) 
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2-2 Automata network 


is a set of interconnected automata: 


PEPE TE 


A network is fully characterized by: 


- the number of automata 
- their interconnection architecture hardware 
- the interconnection weights software 


- the transition functions memory 
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is 


2-3 Different classes of automata 


We will use only generalized linear automata: 


sj =f( Aji.) avec Aj = +k Wik sk 





threshold sigmoid 
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2-4 Propagating activity 


- One shot computing 


NETWORK 


Data Output 


- Computing with fixed points 


The dynamics of the network 
is characterized by an equation: 


s(t+1) = F[ s(t) ] 


ee rel 
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3 - Adaptive Systems 
Definition 


Systems which are able to adjust their structure to 
improve their performance ( according to a desired 
criterion) by a training process . 


d (desired output) 








x(input) y(output) 


e (error) 






processor 


adaptive 
algorithm 







Figures from "Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. _ 
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3-1 Adaline (Widrow 


input 
Pattern 


Analog 
Output 


x ‘ Output 
3k 
Threshold 
Device 
X ; 
nk 


ae. | 


x 
1k 





Desired response input 
{training signal) 


The processor is an Adaptive Linear Combiner 
(A.L.C). 


There are: | 
- an input signal vector with elements xX ,,......... P:Sht 
-parameters that will be named weights 


-a summit unit giving an answer y which is the 
discriminant function : 
y(xJ=tW.x+Wo, Wo is a threshold unit. 


Figures from "Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from "Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Adaptating the system means learning and updating 
W through examples vectors x, from a learning set. 


The linearity of the processor gives the name of this 
adaptive system= Adaptive Linear Element , Adaline. 


Choice of a criterion 


Most of the time, a quadratic error function C is 
chosen as a cost function which must be minimized. 


C = C(W)=[1/2n]. Ek [tWoatWox - yu 


where n is the number of learning set vectors. 


Figures from "Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Choice of an adaptive algorithm 


Minimizing C(W) , that is computing W* such that 
C(W*) is minimum, is classically done through a 


gradient technique: 


- starting from W(0) 
- W(t) is iteratively computed by: 


Wi(t+1) = Wit) - e(t) . VC[ Wit) | 


e(t) is the iteration step and VC is the gradient of C. 


Figures from "Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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In the quadratic case: 


Wit+1) = Wit) - eft). Sk [tWxk- dk] xk 


or, in the stochastic gradient version 
( Widrow Hoff rule 1959): 


W(k+1) = Wh) - e(k). [ tW xk - dk] xk 


Figures from "Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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gO 
CO 


Interpretation of Adaline in a classification task 





discriminant function y{x) = tW.x+Wo 


This function separates into two classes: 


Exemple in a 2-D pattern space 


class 1 ={x: y{x)20} 
class 2={x: y{x) <0} 


Adaline computes a seperating line in pattern space: 


y =X1Wj + X9 Wo + Wo =0 


Figures from “Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from aia Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 


NEURAL MODELS AND LEARNING ALGORITHMS © F.FOGELMAN SOULIE,M. De BOLLIVIER 
SSGRR, Tutorial course on NEURAL NETWORKS, L'Aquila, may 15-19, 1989 15/05/89 


ee 2d 


xX = +1 
0 








(-1,41) 
°O 
Separating 
Line 


Figures from "Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from "Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Limitation 


In case of non linearity, no single line exists that can 


achieve this separation of input patterns. 


Exemple: 

o 
o x 

o Oo 

ee 
x 0 
x Ox 
xo * 


Figures from "Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from "Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Madaline 
In certain cases, it is possible to solve some not so 


non linearly problems. 





Separating 
Lines 


In this exemple, it is possible to separate input 
vectors with two seperating lines. For that, Widrow 
has used a two-neuron form of Madaline. 


Madaline is composed of a few parralel Adaline. 


Figures from “Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Madaline 


Output 





Figures from “Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Madaline which computes non linear boundaries 
The function is: 


VY = Wo + X1Wy+ x" Wi1 + X1X2 W12 + Xo” W222 + X2W2= 0 





Separat7. °* 
bounda's 


Figures from "Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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A 


daline Application : adaptive interference cancelin 





Primary | 


: | 
i INPUT + 
Signal 1s + Mo 
source 











Noise 
source 


Reference 
input 


Adaptive noise canceler 


System 
output 


The system output is e = Stno-y. Widrow aim was to 
minimize output power E(e2), that is the noise power 
will be minimized 


Figures from "Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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cancelling the maternal ECG in fetal 
electrocardiography 


For cancelling the noise, Widrow proceeded as 
following: 


- 4 chest leads where used to record the mothers's 
heartbeat and provide multiple reference inputs 
(Learning set elements) . 


- a single abdominal lead was used to record both 
fetal's and mother's heartbeats that served as 


primary input. 


Figures from “Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from "Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Mother’s 
cardiac 


leads 


Neutra! 
electrode 


Fetal 
cardiac 
vector 





, Abdominal 


_ placements 





(a) (b} 


Figure 12.18 Canceling maternal heartbeat in fetal electrocardiography: (a) cardiac electric 
field vectors of mother and fetus; (b) placement of leads. From B. Widrow et al, Adaptive 
Noise Canceling: Principles and Applications, © December 1975, IEEE. 


Abdominal lead (primary input) 





Output 


Reference 
-inputs 


Figure 12.19 Multiple-reference noise canceler used in fetal ECG experiment. 
From B. Widrow et al, mules Noise pga 2 Principles and Applications, © 
- December 1975, EEE - ... = 


Figures from “Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in [EEE 
ASSP Special Issue on Neural Networks (1988) and from "Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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We” aes - 


{a) 





(b} 


Mother 


nally i 


{c} 


Figure 12.20 Result of fetal ECG experiment (bandwidth, 3~35 Hz: sampling 
rate, 256 Hz): (a) reference input (chest lead): (b) primary input (abdominal jead): 
(c) noise-canceler output. From B. Widrow et al.. Adaptive Note Canceling: Princt- 
ples and Applications, © December 1975, IEEE. 


Figures from “Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing” Widrow 


and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Chap. 12 


Figures from “Layered Neural Nets for Pattern Recognition” Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 


Adaptive Interference Canceling 


eran 


yer Mother 


(b) 


Fetus 


(e} 


Figure 12.21 Result of wide-band fetal ECG experiment (bandwidth, 0.3-75 Hz: 
sampling rate, 512 Hz): (a) reference input (chest lead); (b) primary input (abdomi- 
nal Jead); (c) noise canceller output. From B. Widrow et al., Adaptive Noise 


Canceling: Principles and Applications, © December 1975, IEEE. 


and Stearns. Prentice Hall Signal Processing Series. 1985. _ 
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Output power 
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input 
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| mame eee oe —_ oe a ome Ligtenee 


Adaptive noise canceler 


Interference J 


Reference 
INPUT 


Figure 12.22 Canceling noise in speech signals. From B. Widrow et al.. Adupure 
Noise Canceling: Principles and Applications, © December 1975, IEEE. 








Number of adaptations (hundreds) 


Figure 12.23 Typical learning curve for speech noise-canceling experiment. From 


B. Widrow et al., Adaptive Noise Canceling: Principles and Applications, © December 
1975, TEBE. 


Figures from “Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing” Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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Conclusion 
Adaline limitation is due to its linearity 


But alot of tools used in connectionism now were 
developed: 


- mean-square error criterion 

- the Widrow-Hoff adaptive algorithm 
Widrow have obtained for 20 years, good results using 
that sort of methods in real-world problems like 
echo-cancellation in telephone and adaptive 


antennas. 


Figures from “Layered Neural Nets for Pattern Recognition" Widrow, Winter, Baxter in IEEE 
ASSP Special Issue on Neural Networks (1988) and from “Adaptive Signal Processing" Widrow 
and Stearns. Prentice Hall Signal Processing Series. 1985. 
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3-2 Perceptron 
3-2-1 the network 





Y 
p 
y 
p 
b 
p 
p 
p 
é 
p 
, 
p 


association cells decision unit 


Each association cell computes: fj:S2-— S 

The decision unit computes: F=1(33 Wi -B] 
The decision unit is thus an Adaline. 

Let CLASS 1={x:F[ x]= J 


CLASS 2=(x:F{x]=-1 


A perceptron can be trained from examples to perform classification. 
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a. BA 


3-2-2 The perceptron learning algorithm 


- start from [ Wj (O)k 
- at time k, choose an example xx 


> if x, is correctly classified 
fie. x, © CLASSI and 1 Wi fi (xx 
or x, € CLASS2 and +) Wi; fi (xx 
then W(k+1) <— W(k) 
> if x, is misclassified: 
ie, x, € CLASS1 and 5) Wi fi (xx )- 8 <0 
then Wj (k+1) <— Wj (kK) + fi [xk] 
x, € CLASS2 and 3: Wi fi (xx) - 820 


then Wj (k+1) — Wj (kK) - fil xk] 


3-2-3 The perceptron convergence theorem 
The percepiron learning algorithm converges: 
- in a finite number of steps 
- for any initial weight distribution 
provided: 
- Fis "computable" with that perceptron 


- all examples are presented infinitely often 


i ee 
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3-2-4 Limits of the perceptron 
The perceptron can compute only order 1 problems: 


F is of order k if k is the smallest integer such that: 
- V i, fj depends on k variables at k variables at most 


-F= L(YWif | 


Example: boolean functions in 2 variables 
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The function EXCLUSIVE-OR (order 2) cannot be computed by 
an arbitrary perceptron: 


the association units fj have to be chosen adequately 





a a a a ee 
NEURAL MODELS AND LEARNING ALGORITHMS © FFOGELMAN SOULIE,M. De BOLLIVIER 
SSGRR Tutorial course on NEURAL NETWORKS, L'Aquila, may 15-19, 1989 14/05/89 


37 


3-2-5 Limits of adaptive systems 
- only linearly separable problems 
-no invariance with respect to transformations 
ex: translation, rotation... 
- if the problem is not solvable, no approximate solutions 


-no hard problem 
to go beyond these limits: 
- pre-processing or coding of data 


- use of multiple decompositions : 


networks with hidden units 
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3-2-6 an example: learning the pasttenses of english verbs 
(RUMELHART, McCLELLAND) 
In kids, it goes through 3 phases: 
1- small number of irregular verbs (very common) 
used correctly 
2- use of linguistic "rule": ADD -ED 
applied to a large number of regular AND irregular verbs 


3- restoration of ability to correctly conjugate irregular verbs 
while maintaining ability to apply rule 


AIM: build a program which will learn past tenses from examples 


Use a perceptron: 


Representation 


@ phonological 
© in Wickel features 


@90ee8e068 0 © 
Oooo 9000 
Ooo0odoedsd 
©®e@eeed9eads 


coding association decoding 
network network network 


PERCEPTRON 


retina association decision 
cells cells 


The network is a perceptron with: 


-a coding layer 
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- an association layer 
-a decodaging layer 
The association net is adaptative 
- input cells contain a transcription of verb root 
- output cells contain verb in past tense 
The coding net transcribes verb into set of Wickel features. 
Each cell changes state according to a probabilist rule: 
P(xy=1l)=1/[1+expl-j-8j)/T)] 
where T is "temperature". | 
This is thus practically identical to perceptron 
except for the use of probabilistic units. 
The learning rule is the PERCEPTRON rule. 
SIMULATIONS: 
3 classes of verbs: 
- 10 very common: 
come, get, give, look, take, go, have, live, feel 
(8 irregular + 2 regular) 
learnt first with 10 presentations each. 
- 410 relatively common: 
334 regular + 76 irregular. 
presented 190 times each together with first 10 
- 86 rare: 14 irregular + 72 regular 


are used for test 
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RESULTS 
- first 10 verbs are correctly learnt. 


- introduction of new 410 verbs destroys performances on 
first 10: 


machine treats them as if regular 
After learning session completed: 
performances on 2 sets practically perfect 
-on test set of 86 verbs 
- 48 out of the 66 regular correct 


-- on the 14 irregular, algorithm seems to have discovered 
useful RULES, such as: 


weep —> wept 
cling —> clung 
for verbs never seen during learning phase 


Those rules have not been introduced 


EXPLICITELY into the system 
but IMPLICITELY through examples 


i a a a 
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4- ASSOCIATIVE MEMORY 


Goal: realize a system for automatic association 


« 


complete missing information 


em 


Fal 
ra 


white - black open - closed 


: | associate informations 


A key x ¢€ Re allows to retrieve an information: ye RP 


( noise resistance) 
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4-1 Orthogonal Projection 


Let 3 be the subspace spanned by the examples xj, ... Xp. 


Then: K=XG += 


where xg =Lj=ol...m oj X 





a —- xX g 
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This method allows to reduce noise: 


xX=Xk t+ Vv 
=> Xg = Xk + Vg 
=> IlVvg ll Silvil 





z=<x- 
can be interpreted as the novelty of x with respect to the x, 


JEWEL 


NOVELTY 


Xq is computed from x through the Gram-Schmidt 
orthogonalization method: 


X] =X] : . 
Xk = Xk - Diet... kok, HP I] / Ig hh 

43% 0 k= 2,...,.m 
X, = Xm+1 
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4-2 Optimal associative application 

The desired association is: Xk 7 Yk 
with Xj,...XméR2 and yi,....¥m€ KP 

to be realized through a linear correspondance : 
Yk = W Xk V k(1,...,m) 

ie Y=WX 

Equation Y = W X has a solution if and only if: 

Y=YX+X 


where X*+ is the pseudo-inverse of matrix X. 

The general solution is then given by: 
W=YXt+Z(1-X Xt) 

where Z is an arbitrary matrix with same dimensions pxn as W. 


Matrix W can be approximated by noting that W minimizes the 
cost function: 


C(W) = Yk ll Wk - yk II? 
hence an iterative algorithm to compute W: 


Wi(k+1) = W(k) - €(kK)[ W xk - yk) ‘xk 
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4-3 Examples 


Associative memory for character recognition: 


Capacity 
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4-4 Brain State in the Box 
y = SATUR[ Mx] with M=aW +81 _ 
where W is the matrix of the optimal associative application. 


The process is reiterated until stabilization. 
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4-5 Hopfield network and spin glasses 


The output is thresholded: 
y= i [Ly Wy 4 -8i]  i=lL...n 
ie. y=iI[WX-f] in matrix notation. 
where: W=XtX 


This is known as Hebb rule. 
Remark: 


The optimal associative mapping, in the case of auto- 
association is: 


W = Xx? 
When the xx’s are independant: 
xt = tx xtxy-l 
When the xx's are orthogonal: xx =I 
In that case: | 
W=xXxt=Xx 
which is just Hebb’s rule. 


Reiteration -- fixed points because of the existence of an 
energy function. 
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4-6 Other retrieval schemes 


When W has been learnt from examples, the network can be 


tested in retrieval mode: 


x — y= fix) 


where f can be any transition function 
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5- OTHER MODELS 


5-1 Simulated annealing 


Statistical mechanics: 


a theory to find properties of large ensembles of molecules 


more precisely: most probable states of molecules 


Each configuration s is characterized by its energy E(s) 


and s has probability: 
P(s) = exp[-E(s) / kT] 


to occur, where k is the Boltzmann constant and T is the system 
temperature. 


The most probable states are those with lowest energy. 
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How to find those states? 


Metropolis (1953) has proposed the following algorithm : 


- at time t, pick a molecule i at random 


- try to modify its state from s; to s; + As; 


- evaluate the variation of energy: 


AE = E(Sq, 00) Sj + AS; 500e5S_) > E(Sy, -o+y Sj y0+9SN) 


if AES 0 then accept modification S; --> S; + AS; 
if AE> 0 then accept modification s; --> s; + As 
with probability 
P(AE) = exp(-AE/ kT) 


(if T is large P(AE) = 1, if T is small P(AE) = 9) 
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Simulated annealing is a Metropolis algorithm 


where T is allowed to vary: 


- at the beginning, T is large 
- T is slowly decreased 
- at the end, T is almost 0, 


--> energy minimal 


Method to escape local minima 


energy 


state space 
Applicauons: : 


- chip placement (Kirkpatrick) 


- combinatorial optimization 
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5-2 Boltzmann Machine 


A probabilistic automaton is defined by a probability distribution: 


P(s; = }) =1/ [ 1 + exp(-AE; / TI 


where T> 0 is a parameter called the temperature 


$s 





and E,is the energy of the network: 
Ej (s) =- Leahy: ich Win Si Sn + 2a 9 Si 
Then: AE; = Dh Win Sh . 6; 
The probability of a configuration © of the network is then given by: 


Pa = exp(-E,/T) / LB exp(-Eg/T) 


i TOOT LIVIER 
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A Boltzmann Machine has units which are probabilistic automata.: 


-a set of Visible units V 


-aset of hidden units C 


(©9000 9) 
ete — KR — 4 
Visible units 


Bet 
> 9) 


Hidden units 





The Machine is used in two modes: 


-in forced mode,states of visible units V are held constant. 


For each state Vy of the visible units, 


the network reaches an equilibrium state (VQ, Cg) 


-in free mode, all units are allowed to evolve freely: 


For an initial state VQ, of the visible units, 


the network reaches an equilibrium State (V',, C'g) 


er PP 
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The network will be said to have learnt a model of the 
environment if: 


P( Vq ) = PC V'q) 


where 


P(V) denotes the state distribution of the visible units in forced mode 
and 
P(V' ) the state distribution of the visible units in free mode. 


In practice, try to minimize the distance between those distributions: 


G=Xq P(Vq). In P(Vq) / P(V'g) 


A gradient technique to minimize this distance leads to modify the 
weights by: 


Win (t+1) = Win (t) + €(Pih - P'ih) 


where 
Pip denotes average probability that sj = sy = 1 in forced mode 


and 
t, . e 
Pp ih 2 e OHS FOS AHS OHS OHS OOS SCHR HES CERES HOHE EOF OHH HER SHE SEH EET ORE In free mode. 


Pip and p';, are in fact computed as temporal averages 


instead of state averages (ergodicity) 


te PP SP A 
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Examples: 
Learning a XOR: 


96% success in 20 iterations with T = 10 
= 100 % success in 255 iterations with T = 4 
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Learning symetries 


Problem: recognize if a figure has a symetry axis 
output = 100 if horizontal axis of symetry 
010 if vertical axis of symetry 


001 if diagonal axis of symetry 












co 3 output units 
+ (visible) 
CELE E errr 
* 12 hidden units 
AGRA Ss wee ee 
CoCo eo 
+,  nxngrid 
(visible) 
ei 
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Other applications: 


image: separating figure from ground 
translation invariance... 


speech 


but computing time ! 


necessity of estimating the pj, and p'jp, at each step 
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5-3 Hopfield networks 


- Define an energy ad-hoc for an optimization problem 
- Run the dynamics 
- until fixed point is reached 


with simulated annealing, this fixed point will be a minimum of 
energy 


Example: Traveling salesman problem 


For 6 cities: A, B, C, D, E, F, a tour is represented in a matrix: 


1 2 3 4 5 6 
A 0 1 00 0 4 
B 0 0 01 0 9 
c 1000 90 90 tour: CAEBDF 
DO oO 00 1 @ 
Eo o 10 0 9 
F 0 0 00 0 1 
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With the energy function: 


E= a/2 me di Lnsi Xi Xyh (1 if town v has 2 ranks i h) 


+B/22, 2, deey Xv Xwi (1 if towns v w have same 
rank i) 

+y/2 (2, 2.x, - 0 ’ (0 if each town has exactly 1 
rank) 


+8 /23 2 


WV 2; dw Xvi (x, i+l +X, ie 


(minimum for shortest tour) 


and the allowable changes of states: 
modification of the ranks of 2 cities: 
ABCDEF Ce«--r>F ABFDEC 


A A 
B B 


Hopfield's method gives 


- approximate solutions 
- Computation time is long (because of simulated annealing) 
- number of elements in network varies as n2 


EE VEE ET 
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5-4 Elastic net (R. Durbin et D. Willshaw) 
-- M cities, indexed by k.= 1...M 
-- Nona circular network, indexed by j= 1...N 
-- The algorithm: 
+ select city k,k=1,....M 
+ for each node j, move j: 
yj(t+l) =yj(t)+Ayj 
with 
Ayj= Blk Wkj(XK-yj) + BK Cy jar - 2yj + ¥j-v) 
force towards force towards 
cities neighbors on net 
where 
wij =F xu ygbK) / Uj FU xx yilB®) 


influence of normalisation: total influence of city k 
city k on } is the same for all k 


where 


y ; (t ) is the coordinate vector of node j at time t 


x, is the coordinate vector of city k 
with: 


f (d, K) = exp(-d2/2K?) 


a en Te 
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It is possible to show that: 
A yj =-KdE/0 yj 
with: 
E=-K ud, in S3f (ixe-yjh K+ 8 Bly jer - yy P 


Hence, if K > 0 et N > ~», then E minima are (true!) optimal 
solutions of the Traveling Salesman Problem. 


¥_.1 Example of the progress of the ° < 
site net mctbod for 100 ciles ran ENe * rs 2 : - 
domly disuibued in the wail square. *e ° »* ° 

4, The initia) poth To breal symmetry, -* 8 % 
this was set wp to be a ring of radius *. s * 


e ale a" e 
©.1 about the ecovreid of the cite. A a i) e 
* e 





sustiog sonbgunton with each pom" wy oe 
on the path placed al a emall randomly m «% Sf 
choses Gisuner away from the exa- s * . ." z 
void (sex sef. 13) was found to be, a ~~ we 
adequate as long as 8 barge number of *. we gt oe 
erations at a high value of X were F 7s 
allowed b, ¢ and df, Paths geoerated oS 


aw the value of K was grdually ; 
lowered. @ The tou of keogth 7.78 ney 
écduced from the fina) eonbgurstce 

shown is d f The shortest tour so far 


found by us by any wetbod (simulated dys € I 
anocaling™*™ with eight! million tras, 

eet Table 1 for Sctalls). This tou bad ; 
a length of 7.70. The parumete valves ; 

wed for tht compvuton dacibed : 

i this paper which msed the elastic met 

wetbod wer: Af = 2.5 N, 2 = 02, pe 

20, the initia) wahic of K was 02, and an, 

was redverd by 3% every » hersiioss 

to 29 fina) walue is the range 0.0)-0.2 


be the ealeviation shown bert, K was Pon tt) 
reJverd to 0.05 in 7,000 heratons by . : a 
taking = 25. The rate a1 which X eas be redvoed be Excited by situations soch as that shown by the arrow kn €, where the system mui be 
wuficeatly relaxed to prevent fhe eccurresce of erouoven, which would increase the fom ae sty function @ had the form (4, K)= 
expl—42/2.K2). En this ease an energy Fusction FE can be defined ms Eq--eXT, mL -2 K)+ BL, byes 7,F- This bas the ay 
that by, = —K aE Jay, which means thet any change ix y, according to equation (1) resuks ins reduttion sp fhe waluc of E and, became 

Oe a Tn DES winine of F will eventually be reached. In the Emah where K tends to rero, for F to remain bounded the Banat 
paib wust pass through all the cites (oe every 1 there soust be 8 f rock that fx,—9,] tends to acro). Ax AY gets large, the eecond tere 

eipression for E is thes minimized by Te, the Af path points equally around the pach wis distance D/A apan, where D bs the path 
length This tere thes takes the value iM, which ls minimized by minimizing the tour length D. 


a 


NEURAL MODELS AND LEARNING ALGORITHMS © F.FOGELMAN SOULIEM. De BOLLIVIER 
SSGRRTutorial course on NEURAL NETWORKS, L'Aquila, may 15-19, 1989 14/05/89 


70 


6 - TOPOLOGICAL MAPS 


6-1The algorithm 


The brain is organized in regions which correspond to different 
sensorial modalities. 


In each region, 


the topological structure (in the cortex) 


is the same as 


the topological structure of the sensors. 


One thus has feature maps in the brain: 


retinotopic, somatosensorial... maps 


pied 
estomac 
poitrine 
épaule 
avant bras 
main 
Seme doigt 
deme 









visage 
lévre supérieure 
lévre inférieure 
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The interesting property here is the following: 


-a data space being given, for example signals on the retina 


- how to build a representation of that space in another 
space (the feature space} so that: 


similar data have similar representations 


Goal: find a mapping between two spaces (data and 


representations) 


which conserves topology. 
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The topological maps model (Kohonen) gives a mechanism to realize 
that: 


- representations are coded in a network's weights 
- the Mapping is learnt by modifying the weights 


- the learning process is unsupervised: the representations 
associated to each example do not have to be known a priori. 


- at the end, each cell “represents ” a class. 


The algorithm: The network : 
- Present an example ste (x... x) i(k) 
Pe OOOOOO Output 
- choose nearest cell i(k): Wf Ma (1-D) 
k x 
Jx-W [|=Min || x-W | nee 
ik) i i 


xT 0 x, Input 
- modify weights for all cells in N,. 


AW, (k+1) =€(k)[ x*- W, (k)] iff € Ny Xo AN? Data (2-D) 
0 ifi EN, x, 


In this algorithm, the weights of the nearest cell i(k) and of all the 
cells in the neighborhood Nx of cell i(k) are moved towards xk, 


Ni is variable in time : 





- at the beginning, Ny is relatively large 





- Nx decreases in time 
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6-2 Applications 
6-2-1 speech 
Phonemic maps are built by the above algorithm: 


From continuous speech (in finnish), speech spectra are computed in 
15 frequency channels, each 9,83 ms. 


Those frequency spectra are the data. 
-a2-D representation map of phonemes is built: 
- —a phoneme x¥ is presented 
— the nearest cell i(k) is computed 
-- and labelled with the corresponding phoneme 


-result: phonemic map 


©) 
@) 
©) 


OO OO MOOO® 
OOOO @OO O08 OOO 


double label : neurons corresponding to 2 phonemes 


+1 auxiliary map for /k,p,t/ 


HS ——————— 
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G4 


aword humppila is viewed asa 


- quasi phonemes sequence: 
ctory in phonemic map 


traje 





14/05/89 
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6-2-2 Traveling salesman problem (Angéniol) 
~ Mcities labeled by k=1...M 
~- A ring (the net ) hasa variable number of nodes, N , labeled by j. 
There is a process to create and delete nodes . 
-- Algorithm description: 
For each city k, k=1,....M 
+ find the nearest node j= j(k) 
+movejtok: 
Wj (t+1) = Wj (t) + £8 (T, nj) : (xk-Wj (t )) 
with 
Wj (t): coordinates of node j, at time t 
Xj: coordinates of city k 
nj: distance jto j(k): 
nj = inf [j-j(k) mod N, j(k)-j mod N] 
§ (T, n) = [exp(- n? / T2)] /V2 
+ temperature T varies after each step: 
T := (1-a) T 
+ creation of nodes: 


a node is duplicated if it has been chosen.by two cities ina 
step 


+ deletion of nodes: if a node has never been chosen for 3 
succesive steps, it is deleted 
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FIGURE 1. Evolution of the ring on a set of 30 cities used by Tank and Hopfield (1985). The solution shown here is the same as 
Lin-Kemighan's, and happens with probability 1.5 x 10” for « = 0.2 








4.38 4.36 
. 2§ _ §0 
number 20 . az 0.2 40 ax 0.02 
of tries 1§ (20,000 tries) 30 (2,400 tries) 
percent 4 20 
5 q 10 4.55 
0 _ 0 
4.2 4.3 4.4 4.§ 4.6 4.7 4.2 4.3 4.4 4.§ 4.6 4.7 


Tour length Tour length 


FIGURE 2 Two histograms showing how often each value of tour length was obtained for the 30 city problem, in 20,000 tries for 
parameter a = 0.2, and in 2,400 tries for a = 0.02. Each try corresponds to a random but fixed order of cities for all surveys during 
one simulation. The lowest, largest, and average lengths obtained are indicated. a = 0.2 gives an average Sength of 4.38 and 
reaches the optimal length 4.267 (trom the Lin-Kemighan tour) with probability 1.5 X 1078 (31 times in 20,000 tries). Each try takes 
less then 2 seconds on an Apollo workstation, The average value obtained after 40 tries Is 4.31 (*}, only 1% off optimum. a = 0.02 
gives a better average of 4.36, but never reaches optimum, and each simulation tasts ten times longer than with a = 0.2 Lower 
vaiues of 2 do nat enhance performances. 





FIGURE 5. A set of 1000 randomly distributed cities. The solution proposed here was obtained with a = 0.01, to ensure a good 
resutt In one try (path length = 18,036 in # 7,000 x 500 rectangle). It took 12 hours. Simutations with a = 0.2 take 20 minutes and 
also give good reaults (lengths ranging from 18,200 to 18,800). 
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6-2-3 Image matching (Keller) 
The problem: electrophoresis gels matching 










The network 
Network %2 
rit tt | TT 
inputs 
Data! 
— @ = x 2D-gel 
Rae ~ C]x, 1 g 
lf a eid 
wl 
iz 
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The algorithm 
- all weights are modified. 


—€ = eft,d) = I(t) exp [- d? / 2 s(t) 
with d=1i- ifk) | on the network 
-s and I vary linearly in time so that: 


¢ at t=0, all cells are equally modified 
eat t = trax, only the central cell is modified 


B(t,d) --> t_max = 24 d_max = 6 1(0) = 0.38 I(t_max) = 1 






LS 
> 


Intensité B(t,d) 


voisinage d 4 5 6 


€ as a function of d, for various values of t 


ee ee er eee 
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Results 





Data: reference gel 
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reference gel: network16x16, t_max = 16 and weights 
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Reference gel: network13x13, t_max 
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7- MULTI LAYER NETWORKS 


7-1 Multi-layer networks 


Networks with: 
external layers : input and output 


internal layers: for intermediate processing 


n° couche 0 - 





Goal: realizé an association xk > yk , k = 1,...,.m 
Inputs presented on the first layer are propagated : 
xi=f[Ai] with Ai =2xK Wij xj 


—+ a computed output on the last layer: Sk 


eT? 
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7-2 Gradient back propagation algorithm 


A cost function is minimized: 


C(W) = 1/m Yk CK with CK=Ys [Sks - yks I? 


through a stochastic gradient method: 
wk = wk- li, -cKk 9 Ck / awkyj 


. Xy Xj x; calcul 7 


n® couche n-1 n n+ 
i i 
calcul des d / 4 


i ji ji 5 


Hence the rule for modifying the weights: 


WKyj = WE-13; - eK fi xj 
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Derivation of the GBP algorithm 
¢ Equations for propagation of states: 


xj =f[Aj] with Aj = yj Wij *j 


¢ Cost function: | 
C(W) = 1/m >% Ck with  CK=Ys [SKs - Y¥s]2 
where s indices cells in the last layer. 


° A Widrow-Hoff rule: A Wih (k) = -¢ (k) dCK/dWin 
but: oCk /8Wip = OCE /dAy . 0 Ai/ OWih= 0CK/0Ay.xh 


Let us denote: 0CK/dAg = fs 
oe if s is an output cell: 
dCK/dAg = 2 [SK - YK, ] . DSKg / dAs 
since SK, = f(As) dCK/dAg = 2 [Skg - Es]. flAs) 


fs= 2 (SKs - YK]. f{As) 


ee for a hidden unit i: 
ack /dA; = Yp 9CK/dAn . OAh /dAY 


but OAR /dOAi = 0Ah / Oxi. oxy /OAX = Whi F(Ai ) 
hence 0CK/dAj = f(Ai Sh Whi . 0CK/0An 
fi=f(Ai) th Whi. fh 
¢ hence the rule for modifying the weights: 
WE in = WK-1 in - c{k) fi xh 


Sn Pie aN arg a ee ee 
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Remarks: — 


- learning requires various sweeps through the 
data base. 


- convergence might be long. 
- parameters may be hard to adjust: 


initial weights: WO 

algorithm parameters 

parameters of function f 

- Results depend on the architecture of the network 


- No automatic way to determine that architecture. 


- Internal cells build up rules characteristic of the 
problem. 
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- The algorithm can be applied to many different 
problems: 


0 associative memories 

® automatic classification 
0 identification 

0 automatic diagnostic 

0 signal processing 

0 image processing 

0 speech processing 


® data compression 


The algorithm can be used without a-priori 
knowledge of the domain. 


But such knowledge can be incorporated into 
the network architecture for enhanced 
performances. 
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7-3 Comparisons with conventional classifiers 
7-3-1- Principal component analysis 


x. 





Ex: data xK in R2 — 


Aim of Principal Component Analysis: 

determine a subspace Ep of dimension p so that 

projections yk 's of the xk's are as dispersed as possible 

<> find a matrix Y,of rank p, such that IX - Yt} is minimum 
Solution: Y = Up Xp Vp y 

where: U is the matrix with columns the eigen vectors of XXt 


V is the matrix with columns the eigen vectors of AtX 
associated with the eigen values Al 2... 2An 


The projection onto Ep has matrix Up Upt 


t 


Wi 





subspace of dimension p p hidden units 


[ 
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7-3-2- Discriminant Analysis 


Example: data xk in R2 : 2 classes. 





Aim of Discriminant Analysis: 


determine an optimal subspace so that projections zk 's of 
the xk's lead 


compact clusters ----> Classes 
separated clusters ----> classification 


Projection: zk = Nt xk 


aE oye ae en I ee ra a EN I net ee eT or 
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Multi-Layer Perceptron with 1 hidden layer of p cells 
where p is the dimension of the optimal subspace for 


Discriminant Analysis 





subspace of dimension p p hidden units 


er | 


Then: 
Wt isa Discriminant Analysis matrix 
i.e. J(Wy zt)=J(M) 
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EXAMPLE :: classify 3 species of irises (FISHER) 
Patrick GALLINARI, Sylvie THIRIA, Fouad BADRAN, EHEI 


THE CLASSIFICATION OBTAINED BY DISCRIMINANT ANALYSIS: 


Oo sétosa 
A versicolor 
+ virginia 


état cell 2 





état cell 1 


classified as/ Sétosa Versicolor Virginia 


real class 

Sétosa 50 0 0 
Versicolor 0 48 2 
Virginia 0 1 49 


ee ce ee ee ee ee ae ee a ee ee ee ee ee ht ee eee 


correctly classified: 98%. 


a a eee 
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Three different networks: 


-1 hidden layer of 2 cells, Network1: 98% correct 
-2 hidden layers of 2 cells each, Network2: 98,8% correct 


- 2 hidden layers of 2 and 3 cells, Network3 : 99,3% correct 
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at 0.5 0.5 1 : 
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° : ats a 
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Network 2: 


05 grat cell 1 ‘ 


hidden layer 1 
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7-4 How to practically solve a problem ? 


Let a small problem of recognition of 2 types of 
biological underwater acoustical signals: shrimps 
and dolphins noises. 


Description of all steps of the process. It can be 
decomposed in 2 steps : 


- patterns choice and preprocessing 


- design and use of neural nets 


STEP 1: patterns choice and preprocessing 


1 - Bulding of learning set_and test set 


Data are recording on a tape. 


The first task is to pick up from it the interesting 
parts of the recording: shrimps and dolphins noises. 


There is no automatic method, it must be done "by ~ 
hand”, 
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A Dolphin noise recording (duration 1 s) 
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Dolphin noise ones (duration 0. 012 s) 
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A Shrimps noise recording (duration 1 s) 
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2 - Signal Coding 


A set of shrimps and dolphins noises have been 
recorded in step1. How to use it? 


Two solutions: 
- Use of raw signal as network input 


- Use of coded (processed) signal as network input 


There is no general answer, it depends on the task. 


wan---- > Use of raw signal in an other example: 
radar classification 


7 classes, signal coding in [-1,+1] 


t -4 1 x 1 200 t 


95, 30 or 15 points in [-1,+1] obtained by sampling 


and after projection. 


36 examples for learning set and 15 for test set 


A 
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Networks 


(}n0 {foo [foon 


Performances: 95% recognition rate for test set. 


Performances in shrimps, seals and _ dolphins 
recognition task are very low: 73% recognition rate 
for test set. 
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------~ > Use of coded signals 
Choice of the coding: 


- Short Fourier transform (FFT) : time-frequency 
coding. 


- - Wavelet transform : time-frequency coding. 


- Linear predictive coding (LPC) : coding of each 
pattern as an array of PARCORS computed for each 
temporal window. 


In Shrimps and Dolphins classification Task the 
performances were the following : 


- FFT: learning set 100% recognition rate 
test set 95% recognition rate 

- Wavelets : learning set 100% recognition rate 
test set 90% recognition rate 

- LPC: learning set 100% recognition rate 


test set 100% recognition rate 


a ee ee 
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In one task ( radar targets classification) raw signal 
allowed good performances. For shrimps and 
dolphins classification best performances are 
obtained after preprocessing. 


There is no general answer about coding, it depends 
on the task. 


For stepl, an a-priori knowledge of the phenomenes 
is necessary. 


Problem of patterns duration 


Normalization of pattern duration is necessary, 
because input pattern size is equal to input layer size 
which is a constant. 
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Step2 : Network architecture 
1- Layers Bulding . 
a) Output layer 


In a classification task the output layer is a 1-D 
array with cardinality the number of classes. 


In our problem the output layer is a vector with 2 
components. The desired answer is: 


- class1 --> (1,-1) 
- class2 --> {-1,1) 


Why 2 components? A one component vector may 
be used, that is: 


- classl --> (1) 
- class2 --> (-1) 
First reason: it works better with 2 components 


Second reason: there is a theoretical background. 
Let an output layer of two components, celll and 
cell2, used for a two-classes discrimination. It has 
been shown (THIRIA, EHEI } that: 


x; = Plex € Ci) , x; state of cellj 


eI 
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That means that state of cellj is equal to the 


probability for input pattern ex to belong to 
class i. | 7 


b) number of layers 
For small problems (Input size = O(100)): 
3 layers (one hidden layer) 
For more important problems : 
4 layers ( 2 hidden layers) 
c) Layers size 


No rules. Logical attitude: deacreasing size from. 
input to output. It can be explained by Data 
Analysis too. 


2) Connections bulding 
a) total connections 


Let L and M be 2 adjacent layers. Total 
connections means that each cell of layer M is 
connected to each cell of layer L. 
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Experiments prove that this choice is not efficient 
on real world problems (input size = O(1000)) for 
two reasons: 


- Super-computers are needed because of the 
high number of connections. Sun4-type 
workstations or VAX 8650 are too small. 


- Basic reason: each weight value 
uncorrelated to the others is a free parameter. 


Too many parameters provide better learning 
but test performances deacrease. Not enough 
parameters makes impossible to learn all 
learning test and test set performances 
deacrease. 


It seems that there usually exists a range of "good" 
connections numbers which allowes "“optimal" 
performances. 
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b) Local connections 


It is used to deacrease the number of free 
parameters. Each cell of layer M is only connected 
to a neighborhood of cells in layer L. 






Ee eee 
Lp 
t+ lt 
ppt te 

Sar 




















A window (8x3 on fig) is moved on Layer L. 


Local features extraction is expected. 
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c) Masks ( Hinton, Lang and Waibel) 


Hinton, Lang and Waibel applied a particuliar type 
of local connections to speech recognition with 
success: better performances than those obtained 
using classical methods (hidden Markov models) 
were obtained. 


Their aim was to introduice time in neural nets. 


An other analogy exists : analogy with image 
processing. 


Let layers L and M connected via local connections. 
as in b). 


Cell j ¢ M is connected to a neighborhood of cells 
on L by connections Wie ; 


Diagram representation : 
wy 1 Wi2W13 
Wo1W20W23 
We 1 W3oW33 


The weights values depend on cell j. 
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Diagram representation of masks: 


W11W12W13 
W91 W22W93 
W31W32W33 


Whatever cell j of layer M, its connections weights 
to its associate neighbourghood or window on L 
are the same. 


The same technique is used in image processing 
for local boundaries detection: the same mask is 
moved on the image. 


In neural nets, masks are expected to extract local 
features from layer to layer. 


During learning back-propagation neural nets 
compute the optimal mask according to the mean- 
Square-error minimization. 


That sort of connections deacreases the number of 
free connections: a few masks are needed from 
layer to layer. 
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c) General conclusion on Neuralnets connections 


(using back- propagation algorithm) 


i) input layer to hidden! layer 
and 


hidden! layer to hidden2 layer : 


Use of masks of growing size to detect and to 
propagate local features from layer to layer . 


The most significant local features are supposed 
to be coded in hidden2 cells activities. 


ii) Hidden2 to output layer 


Full connection. Output layer must be able to sum 
. up all local features. 
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d) Improvements: use of hints (suddarth) 


aim : to introduice some knowledge in the output layer 


Add several cells to output layers to compute an other 
discriminant function (or output) . 


Examplel : to compute a XOR discriminant function, 
add a cell which will compute a AND function. 


Experiments prove that performance increase. 


Example 2 : in a radar target recognition, add cells 
which will compute, for example, the number of 
engines .... 


et i 
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e) Conclusion on problem solving using back- 
propagation neural nets 


i) About knowledge of phenomenes: 


step] and step 2 can be uncorrelated. Knowledge is 
necessary in step1 to build and to code learning and 
test sets, but not for neural nets use. Generally step1 
and step 2 are not processed by the same people. 


ii) There is no rule for meural net architecture 
building. 
Users need experience to master it . Usually, some 
experiments are necessary before building the 
"sood" network which will achieve desired 
performance. 


iii) It is very difficult to understand what a network 
really does. A lot of improvements are necessary: for 
example, how to process temporal signals? 


iiii) Nevertheless, spectacular results have been 
obtained which prove that for some tasks Neural 
Nets achieve better performances than other 
methods. 


Example : for radar and sonar targets classification 
tasks, Neural nets have obtained better results than 
Discriminant Analysis. 
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A network which learns to “ talk ” 
NetTalk [T. Sejnowski] 


The problem 


“traduce ” a written text in phonemes. 


Phonemes are then passed through a vocalizor (DEC T alk) 
Data coding 


Input: a letter is coded on 26 cells (its n°) 
+ 3 bits (punctuation) 
Output: a phoneme is coded on 23 cells (50 phonemes) 


+ 3 bits (accents) 


i SS 
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The network 








t 

(26 cellules) TM 89 comple 

(26 cellules) * complete 

couche 1: 

(80 cellules) LURERUAAUOUAU ERA LELSEORDOOATONEENNSETIDNAITE 

couche 0: 

(7x29 cellules) 
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The coding 


- 18 000 weights coded on 4 bits ----> 80 000 bits 

- which have allowed to stock 20 O00 words, 
Le. = 2 000 000 bits 

Important Compression, 


realized by making use of the re dundancies of 
pronunciation. 


tc et 
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The network has produced an efficient coding. 


1- For each association letter-sound 


there are 79: for example 
g-J for 
i- A for 
i-I for 
a-a for 
a-e for 


gin 
bite 
bit 
father — 
bake 


the activity of the 80 hidden units is computed 


——> a vector in RN 


(n=80) 


2- Average on all occurrences of that association in 
dictionnary: —> 79 vectors in RB 


3- Apply a hierarchical classification technique on those 


vectors. 
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3-Z 
3-3 


k-k 
c-k 
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& FIRST QRADE . ms 


dont go in spring or winter because its too cold. My “my brother can go 


swimming in the winter though because he gots his tonsils out you know 
and he and he gets sick wh’ sick um once in a few years. I get sick just about 
every day. 


ama = Theres just one thing I cant stand in my family. My baby makes toa 


much noise [ cant even get get to sleep for a minute. He wont stop jumping 
around in the bathtub. 

In the bathtub. 

No. In the crib. He he keeps jumping around gets tired then he goes to 
bed then he finally gets to sleep. Cant go to sleep in about a hour. Not with 
that in the house. 

It would just take two minutes to get to Sleep. Just about two minutes. If 
you just um why dont you get some cotton and plug it in your ears and then 
you cant hear him. 

He makes so much noise he makes so much noise it probably sound 
effect through it. . 

Well what does the baby do. Come out get out crawl out of his crib and 
then come along in your bed and pull out your earplugs. 

Once once he keep jump jumping jumping and then this thing slide 
down and then he fell over to the other bed and he start crying and I 
couldnt get to bed so I I have to wake.up put him back in my crib. 

In your crib. 

No not in my crib. I dont have a crib. 

You said put him back in your crib. 

I mean in his erib. I dont have a crib. 

Uh sometimes I like to go to the I like to go to my grandmothers. I 
would like to sleep over her at her house every day because she lets me stay 
up late about ten oclock or twelve thirty. 

Youre lucky. 

I only get to stay up until eight. 

And I only get to stay up until nine. 

I get to stay up until um say about between ten oclock and nine thirty. 
Uh and sometimes sometimes I get to go to bed at twelve thirty. Sometimes 
but most of the times I dont. On holidays and you know like um weekends. 

On holitinna and I mean on holidays I get to stay up all night. 

- Uh on weekends like when Im not going to school see this day I Im 
going to school and then the next day you dont have to I can stay up late 
because I the next day I can sleep all 1 want. Thats why we have to go to 
bed early on school days.’ 

Every holiday um um my my grandmother and my aunt come over. 

Well you know its because well you know its just about becoming Easter. 
About just twenty days or twenty one. | 

On Easter I have to get all this gooshy egg. 50 mushy with the especially 
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So wat 8 i orang 


38 FIRST GRADE 


You mean uh um like England or something. Wher we walk home from 
school f walk home with two friends and sometimes we cant run home from 
school though. Because um one girl where every time she wants to runs she 
gets the wheezes and stuff. And then she cant breathe very well and she gets 
sick. Thats why we cant run. 

I like to go to my grandmothers house. Well because she gives us candy. 
Well um we cat there sometimes. Sometimes we sleep over night there. 

Sometime when I go to gO to my cousins I get to play soft ball or play 
badminton and all that. 


Thing I hate to play is doctor. Oh. I hate to play doctor or house or that. 
Dont like it or stuff. 


Weve been Icarning a lot of Spanish words. Our teacher speaks Spanish 
sometimes. 

So does my father. . 

Well my father doesnt know very much Spanish but he doesnt know 
“what gray is in Spanish and its gris and he doesnt and he knows what blue 
is in Spanish and he knows what um red is. {n Spanish. And somctimes I 
like to go to Mexico but Ive never been there before. Only when I was a 
little teeny baby I been there and I dont even remember it. 

There this one night I couldnt get any food. I mean there was this onc 
day F couldnt get any food at home unless | asked it for Spanish. 

My um my mother and father is going to pretty soon take us to 
Philadelphia. And were going to sec our grandmother there. 

I wish we went to uh we went to Mexico. Not Mexico San Dicgo once 
and they had a little um pool that was full of water and it was two feet. And 
then they and then they had another pool it was five fect cight fect. Randy 
my brother went in cight feet and I went in five feet and I think there was a 
three fect. There was. And I jumped off and I uh and! Jumped off the edge 
of the swimming pool. I got on the edge and I jumped off and then I holded 
on to a edge because I couldnt swim very well. 

When I start whch I started to swim I I | I was always holding on to the 
edge I wouldnt dare to gO more than this away from the edge or else I Id Id 
Start jumping dancing into the water. : , 

When my father wanted to take @ picture of me with you know one of 
those floating things one of those floating rings that you put around you but 
1 dont want to because you know | know how to swim. But when I took it 
off I almost drownded. And | was jumping up and down to sce if 1 could 
swim or not. And, ; - 

Um I live in an apartment and we have a big pool and its cight and a 
half in part and four and a half and three and a half. And this summer I get 
to go swimming in it. ° 

In the summer we go swimming. And thats when my birthday is. We 
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ASSOCIATIVE MEMORY: TOMOGRAPHY IMAGES 
SIMULATED TOMOGRAPHY (C. Obéllianne, G. Galibourg) 
- 2-D reconstruction from multiple projection data in medical imaging: 
- x-ray | | 
- positon emission tomography 
- echotomography 
- NMR... 


- detectors measure the projection of an object along a line: 


Proj(6, ray,) = J eae f(u,v) dv 


ay 





- reconstruction of the original images can be done through: 
- inverse Fourier transform 


- back-projection a 2 





Back-projection is fast, but it produces systematic artifacts 
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WE PROPOSE TO USE CONNECTIONIST NETWORKS 
TO HELP REDUCING THOSE ARTIFACTS. 


THE ARTIFACTS COME FROM THE LIMITED NUMBER 
OF PROJECTIONS USED FOR RECONSTRUCTION. 


WHEN THE NUMBER OF PROJECTIONS INCREASES 
THE QUALITY OF THE RECONSTRUCTION IMPROVES: 


THE "NOISE" LEVEL CAN BE MEASURED BY THE NUMBER OF PROJECTIONS 


Uncorrupted image 5 Projections 10 Projections 
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ASSOCIATIVE MEMORY ON TOMOGRAPHY IMAGES: DATA BASE 


- 100 prototypes, in 20*20 images and 16 grey levels 
- 9 values of number of projections: 
K = 5, 10, 12, 15, 17, 20, 25, 30, 40 
- reconstruction algorithm : back-projection (Thomson LCR) 


- error is measured by the Normalized Mean Square Error: 
2 
, e“(x,y) 
_ _NMSE ae ae * 100 
dx y My) 
with ey) = g&x.y) - fixy) 
where f is the original prototype and g the reconstructed image 


- the NMSE decreases when K increases: 


5 10 15 20 25 30 35 40 
number of projections 


VARIATION OF NMSE WITH THE NUMBER OF PROJECTIONS 
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THE RESULTS 
- learning set : 
-- 20 prototypes 
-- 3 levels of noise: K = 10, 17, 30 


- test set : 
-- same 20 prototypes 
-- 6 levels of noise: K = 5, 12, 15, 20, 25, 40 


- the network: . 


various architectures have been tested. 











——> _S 
full full 
connections connections 





- the results 


-- learning set : 100% 
-- test set: 85 % on average 
; Aenean NMSE (input) 
- error reduction = *hyspioutput) 
-- is larger for hard tasks (high levels of noise) 


-- increases during learning 


i 
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| ERROR REDUCTION ON TRAINING AND TEST SETS 
60 
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error reduction 


@ leaming 
20 —* generalization 


10 





30000 80000 130000 
nb of iterations 


a 


Desired output 








15 Pro) 


bag tee 


were B.S 
~~ 








. 
. 


aay | Ee | 





ta 
Prt tad 
a 


RECONSTRUCTED IMAGES FOR PROTOTYPES IN LEARNING SET: 15 projections 
(best and least cases) 
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VARIATION OF ERROR REDUCTION WITH THE NUMBER OF SWEEPS 


Input pattern Resulting output Desired output 
15 Projections 
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EXAMPLE OF RECONSTRUCTION 
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NOISE REDUCTION 
in tomography images 


- learning set : 40 prototypes and their reconstructions with 15 projections 


- test set : 60 more prototypes reconstructed with 15 projections 


- the network: one hidden layer of 324 cells with 2 masks 3°3 
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- the results: 
-- learning set: 87.5% 
-- test set : 6.7% 
-- error reduction reaches about 10% 


(15% gives correct reconstruction} 


NEURAL MODELS AND LEARNING ALGORITHMS ©@ F.FOGELMAN SOULIE,M. De BOLLIVIER 
SSGRR Tutorial course on NEURAL NETWORKS, L'Aquila, may 15-19, 1989 15/05/89 


126 


RECONSTRUCTED IMAGE FOR A PROTOTYPE IN TEST SET: 15 projections 
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RECONSTRUCTED IMAGE FOR A PROTOTYPE IN TEST SET: 12,17,40 projections 
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Hand-written character recognition 
(Y. Le Cun, 1987) 
data base | 
10 digits coded in 16 * 16 images 
12 examples of each digit 
(translated in 4 positions) 
learning on 320 examples 
Test on 160 
the network : 
2 hidden layers 
- one with 2 masks of 8*8 cells 
- one with 4*4 cells (local connections) 
Results | 
99,7 % on learning set 
95 % on test set 
Other applications: 
Recognition of zip codes (AT&T Bell) 


Multi-font character recognition 
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Speech: multi-speaker digit recognition 
(L. Bottou, 1989) 
data base: 
26 speakers (male and female} 
each pronouncing 10 digits: 0, 1, 2,...,9 
the signal is coded as a 16*65 vector corresponding to: 
16 frequency levels on a Bark scale 
65 x 12.8ms time slices (about 800ms) 
training set: 
16 speakers, 
each digit, randomly translated in 4 positions 
(in first 100ms) 
--> 640 INPUT VECTORS 
strategy for learning: LEARNING WITH NOISE (gaussian) 
the network has 2 eidees layers: 
-- the first one has 31 cells implementing 8 masks16*3 


-- the second one has 6 cells implementing 8 masks 8*7 


NEURAL MODELS AND LEARNING ALGORITHMS © F.FOGELMAN SOULIE,M. De BOLLIVIER 


SSGRK Tutorial course on NEURAL NETWORKS, L'Aquila, may 15-19, 1989 15/05/89 


132 


the results: 
-- 30 sweeps through the data base (1.30 h on a SUN-4) 
-- 99.22 % on the learning set 
-- 99 % on test set (the remaining 10 speakers) 
conclusion 
-- searedily : speaker independence 


-- "phonetic" coding in the last layer weights 


tt 
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THE NETWORK 
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EXEMPLE DE SORTIE (simulateur SN) 
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